Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modelin

A sequence prediction model for DNA built on the Transformer competitor Mamba. It is extremely efficient and powerful for a small model. #gene #genetics

HyenaDNA is a Stanford-built model that uses transformers to predict qualities of the human genome

We’re excited to introduce HyenaDNA, a long-range genomic foundation model with context lengths of up to 1 million tokens at single nucleotide resolution!

Biological simulation

  • Whole-body simulation of realistic fruit fly locomotion with deep reinforcement learning, HHMI Janelia, Google DeepMind. Introduces an anatomically detailed, biomechanical whole-body of a fruit fly for simulating realistic locomotion behaviors, such as flying and walking. Developed with the open source MuJoCo physics engines, the model includes sophisticated representations of the fly’s body parts, fluid forces during flight, and adhesion forces. Incorporating deep reinforcement learning, allows for the creation of neural network controllers that drive the simulated fly in complex trajectories and tasks based on sensory inputs, achieving high fidelity in locomotion simulation.

Proteins

Google has released AlphaFold 3. Google DeepMind and Isomorphic Labs have developed the 3rd generation of AlphaFold, a powerful protein folding prediction model. Now it can predict 3D structure and interactions of all life’s molecules, like DNA, RNA and Ligands. AlphaFold 3 is 50% more accurate than previous generations. It correctly predicted the folded structure of the spike protein on Coronavirus OC43.

Access AlphaFold 3’s capabilities for free using the AlphaFold server.

Also see Jing, Berger, and Jaakkola (2024)

AlphaFold is used to predict the state of a protein after folding. By adding flow matching, which is invertible, you can dramatically improve modeling power on the entire landscape of proteins.

Sequence modeling and design from molecular to genome scale with Evo Sam Hammond thinks this is a really big deal.

Patrick Hsu: To aid our model design and scaling, we performed the first scaling laws analysis on DNA pretraining (to our knowledge) across leading architectures (Transformer++, Mamba, Hyena, and StripedHyena), training over 300 models from 6M to 1B parameters at increasing compute budgets

Evo is a protein language model, an RNA language model, and a regulatory DNA model 🤯

Evo can do prediction and generation across all 3 of these modalities. We show zero-shot function prediction across DNA, RNA, and protein modalities.

Samuel Hammond: SoTA zero-shot protein function prediction from a 7b parameter model. This alone justifies NVDA’s valuation. Every big pharma company is about to start pouring capex into training runs of their own. Text-to-organism is not far. If you doubted the Great Stagnation was over!

Challenges and Counterarguments

The current science is based on a simple, straightforward model that assumes proteins generally fold only one way. But take A Holistic View of the Cell and it’s clear that life is far more complicated:

AlphaFold2, the computational model that predicts protein structures with an accuracy that matches or exceeds experimental methods, was trained mostly on protein structures solved with x-ray crystallography. But again, proteins in cells behave more like liquids than solids; they wiggle to-and-fro in a chaotic dance, and can adopt hundreds of different, distinct shapes.

If one reverses AlphaFold’s predictions, and instead makes the model generative, it tends to design proteins that are hyper-stable and rigid, much like the frozen proteins on which it was trained. This is part of the reason why it will be so difficult to design new functional proteins—AI models are not trained on a complete biological picture.


Cradle raised $47M for Protein engineering without the guesswork: “Design improved variants of your target protein sequence with just a few clicks — and some machine learning.”

Ex-Meta researchers founded EvolutionaryScale, raising $40M and valued at $200M to fast-track AI-driven protein structure predictions. With claims of outpacing Google’s AlphaFold in speed, the startup eyes breakthroughs in medicine and biotech

Microsoft’s open source EvoDiff framework is a 640-million parameter model trained on data from all different species and functional classes of proteins.The data to train the model was sourced from the OpenFold data set for sequence alignments and UniRef50, a subset of data from UniProt, the database of protein sequence and functional information maintained by the UniProt consortium. Alamdari et al. (2023)

Microsoft’s open source EvoDiff

Google’s DeepMind AlphaMissence is a freely available AI catalog that has classified the potential effects of millions of missense genetic mutations, which could help establish the cause of diseases such as cystic fibrosis, sickle-cell anemia, and cancer.

The AlphaMissense resource from Google DeepMind categorized 89% of all 71 million possible missense variants.

Chemistry

ChemFlow: Navigating Chemical Space with Latent Flows. Enhance molecular science by efficiently navigating chemical space using deep generative models.

Navigating Chemical Space with Latent Flows by Guanghao Wei, Yining Huang, Chenru Duan, Yue Song, and Yuanqi Du.

Flows can uncover meaningful structures of latent spaces learned by generative models! We propose a unifying framework to characterize latent structures by flows/diffusions for optimization and traversal.

CRISPR editing

A Stanford-Princeton team that includes Russ Altman design a system intended to make easier the complex task of gene editing with CRISPR.

Because so many different tasks are involved, this approach uses agents that each handle different aspects of the problem.

Huang et al. (2024)

CRISPR-GPT leverages the reasoning ability of LLMs to facilitate the process of selecting CRISPR systems, designing guide RNAs, recommending cellular delivery methods, drafting protocols, and designing validation experiments to confirm editing outcomes

Synthetic biology

AI Accelerates Ability to Program Biology Like Software The Seattle-based synthetic biology startup Arzeda, co-founded by Alexandre Zanghellini, uses its Intelligent Protein Design Technology to design enzymes and protein sequences. The technology draws on generative AI in combination with a physics-based model.

Drugs

see @allthingsapx

Evo, a genetic foundation model from Arc Institute that learns across the fundamental languages of biology: DNA, RNA and proteins. Is DNA all you need? 

Nathaniel Bennett, a computational biochemist at the University of Washington in Seattle developed Antibodies from scratch. They started with RFdiffusion, an AI tool that their team released last year2 that has helped to transform protein design. They modified it

 based on a neural network similar to those used by image-generating AIs such as Midjourney and DALL·E. The team fine-tuned the network by training it on thousands of experimentally determined structures of antibodies attached to their targets, as well as real-world examples of other antibody-like interactions.

References

Alamdari, Sarah, Nitya Thakkar, Rianne Van Den Berg, Alex Xijie Lu, Nicolo Fusi, Ava Pardis Amini, and Kevin K Yang. 2023. “Protein Generation with Evolutionary Diffusion: Sequence Is All You Need.” Preprint. Bioengineering. https://doi.org/10.1101/2023.09.11.556673.
Huang, Kaixuan, Yuanhao Qu, Henry Cousins, William A. Johnson, Di Yin, Mihir Shah, Denny Zhou, Russ Altman, Mengdi Wang, and Le Cong. 2024. CRISPR-GPT: An LLM Agent for Automated Design of Gene-Editing Experiments.” arXiv. http://arxiv.org/abs/2404.18021.
Jing, Bowen, Bonnie Berger, and Tommi Jaakkola. 2024. AlphaFold Meets Flow Matching for Generating Protein Ensembles.” arXiv. http://arxiv.org/abs/2402.04845.